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Semi-parametric item response functions in the context of guessing 

Abstract 

We present a logistic function of a monotonic polynomial with a lower asymptote, al- 
lowing additional flexibility beyond the three-parameter logistic model. We develop a 
maximum marginal likelihood based approach to estimate the item parameters. The 
new item response model is demonstrated on math assessment data from a state, and 
a computationally efficient strategy for choosing the order of the polynomial is demon- 
strated and tested. 

Keywords: filtered polynomial, guessing, item response theory 



March 31, 2015 


2 


1 Introduction 

In item factor analysis, the three-parameter logistic (3PL; Birnbaum, 1968) model is 
a commonly used item response model that includes a lower asymptote (top panel in 
Figure 1). This item model can be particularly useful in educational assessments when 
the lower asymptote represents the probability of correctly "guessing" the answer to a 
multiple choice question. But, what happens when there is a non-zero probability of 
guessing and the item response function (IRF) that generated the data does not neatly 
follow this functional form (bottom panel in Figure 1)? 

It is known that using standard item response models such as the two-parameter lo- 
gistic (2PL; Birnbaum, 1968) and generalized partial credit models (GPC; Muraki, 1992) 
when the true IRF follows a different shape can result in poor recovery of the response 
function, item or model misfit, and suboptimal scoring of individuals' proficiency (e.g., 
Falk & Cai, in press; Liang, 2007; Liang & Browne, 2015; Ramsay & Abrahamowicz, 
1989). We expect these same problems to occur when forcing a 3PL model on item re- 
sponse data that could have been generated by an IRF with a different functional form. 
Remedies to this problem in general include approaches that allow for more flexibility 
in the estimated IRF. For example, non-parametric methods may quickly estimate an IRF 
with good recovery, but with the caveat that the resulting IRF may not be monotonically 
increasing (Ramsay, 1991). Bayesian non-parametric methods may also be employed 
(Duncan & MacEachern, 2008, 2013; Miyazaki & Hoshino, 2009; Qin, 1998), but may be 
slow to estimate (Liang, 2007). 

The present research builds upon recent semi- (or quasi-) parametric item response 
functions that are built by replacing the linear predictor of standard response functions 
with a monotonic polynomial (Falk & Cai, in press; Liang, 2007; Liang & Browne, 2015). 
So far, these approaches have been demonstrated to allow more flexibility to the 2PL 
and GPC item models, and can result in better recovery of IRFs and latent proficiency. 
Specifically, we will show how a logistic function of a monotonic polynomial with a lower 
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asymptote (LMPA) can allow for greater flexibility than the standard 3PL. In addition, 
the item model reduces to the 3PL at the lowest-order polynomial. 

An additional issue we address is the selection of the order of the polynomial for each 
item. Previous research has so far relied mostly on using information criteria (e.g., AIC 
or BIC) for selecting the order of the polynomial, which requires refitting the model. For 
example, Falk and Cai (in press) used an AIC step-wise approach where at each iteration, 
each item in turn was considered as a candidate for being modelled as a higher order 
polynomial. The item that improved AIC the most was selected as having a higher- 
order polynomial, and the process then repeats until no progress can be made. When 
using in conjunction with the EM algorithm (e.g., Bock & Aitkin, 1981), this process 
can be computationally prohibitive with a long test. Thus, the most items used by 
Falk and Cai (in press) was 20. Use of a different estimation approach may be faster 
computationally (Liang, 2007; Liang & Browne, 2015), but has not been demonstrated for 
cases where multiple group analyses are employed or the dataset contains missing data. 
As an alternative, we suggest that candidate items may be identified without additional 
model fitting by considering item fit statistics such as S — X 2 (Orlando & Thissen, 2000, 
2003). Since S — X 2 has been successfully used in previous research to identify atypical 
IRFs, its use has potential in item screening and may result in refitting the model a fewer 
number of times. That is, only items that fit poorly according to S — X 2 may be good 
candidates for modeling with higher-order polynomials. Since it is possible that as part 
of a testing program poor item fit may be used as a flag for deactivating items, this 
approach also has an advantage if it is able to reduce the number of poorly fitting items. 

The remainder of this manuscript is organized as follows. In Section 2, we present 
the proposed item model, LMPA. Section 3 presents an illustration using a large-scale 
state math assessment that uses S — X 2 to screen items before modeling with the LMPA 
with higher-order polynomials. Section 4 then presents simulation results illustrating 
the ability of the LMPA item model to improve IRF recovery in conjunction with S-X 2 
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guided polynomial order selection. Finally, Section 5 contains concluding remarks. 

2 The proposed item response model 

2.1 Logistic function of a monotonic polynomial with asymptote (LMPA) 

To introduce notation, consider i = 1, 2, . . . , N examinees complete j = 1, 2, . . . , n 
dichotomously scored items, with observed item responses i/,y G [ 0 1 ]• One way of 
writing the 3PL model for the "correct" response is as follows: 


^( 1 1 @i> K j> ^j, 7/) 


c ( K i ) + 


1 - fpCy) 

1 + exp( — (Sj + jjOi)) 


(1) 


where Sj and 7 j are the intercept and slopes, respectively. And the pseudo-guessing 
parameter c(Kj) determines the lower asymptote, which is an estimate of the proportion 
of examinees in the latent class adopting the "guessing" strategy, itself reparameterized 
as a function of Kj to allow unconstrained estimation: 


c{Kj) 


1 

1 + exp(-JCy)' 


( 2 ) 


To allow one or more "bends" in the 3PL model, we can replace the slope with a 
monotonic polynomial function for item j: 


P(l\di,6j,<Vj, CCj, Tj) 


c (Kj) + 


1 - c(*/) 

1 + exp (-(Sj + nij(6i, coj, Kj, Tj))) 


(3) 


Here the polynomial (omitting the intercept) is of order 2k + 1, and its derivative with 
respect to 6 is of order 2k, 


m(d,co,tx. r r) = m(6, b) = b\6 + b 2 0 2 + • • • + & 2 fc+i 0 2k+1 
m (0, a) = Ufl T ci\0 T T • • • T fl 2 ^0 2 ^ 


(4) 

(5) 


where A: is a user-specified non-negative integer, which may vary across items. Thus, if 
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k = 0, the model reduces to the 3PL. To ensure monotonicity of the polynomial (i.e., the 
function increases as 6 increases), the derivative is parameterized such that it is always 
positive, which entails the coefficients b = [ b\ ■ ■ ■ fo 2 fc+i ] are a complicated function 
of the parameters co, ct, and r. That the polynomial is monotonically increasing leads 
to a monotonically increasing response function for the correct response. Details of this 
parameterization are given by Falk and Cai (in press) and are due much to the hard work 
of others (Liang, 2007; Elphinstone, 1985). In brief, the coefficients a k = [ uq • • • a 2k ]' 
can be represented in recursive form as: 


a fc = T fc a fc _i = T k T k _ x ■ ■ • T 2 T 2 exp(m) 
where the (2k + 1) x (2k — 1) matrix T k contains only cc k and r k : 


(6) 


T k 


2.2 



1 

0 

0 

0 

0 

0 


—2oi k 

1 

0 

0 

0 

0 


a fc + ex p(Tjfc) 

2 M-k 

1 

0 

0 

0 

= 

0 

oil + ex P( T fc) 

-2a k ■ ■ 

0 

0 

0 


0 

0 

0 

• + ex p( T fc) 

2 oc k 

1 


0 

0 

0 

0 

ocl + exp(r k ) 

—20Ck 


0 

0 

0 

0 

0 

ocl + exp(r k 

We can then easily obtain the coefficients b of the polynomial from a by noticing that 

= for u = 1 , . . . ,2k + 1 . 


Example response functions 


The example response functions in the bottom panel of Figure 1 were based on the 
LMPA model, with k = 1 and k = 2, and their item parameters reported in Table 1. 
These response functions were obtained by analyses conducted on a state math assess- 
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merit discussed in Section 3. Note that higher-order polynomials tend to result in more 
flexibility, but could be prone to overfitting noise in the data. 

Non-standard response functions such as these can be the result of a number of 
different reasons. In the context of educational assessments, we propose that data from a 
heterogeneous population could result in non-standard response functions. Suppose our 
data come from two different unidentified groups that do not differ in overall proficiency, 
but differential item functioning exists for a particular item. That is, the item is easier 
for Group 1 than Group 2 (see Figure 2). Such a case could happen in practice if, 
for example, the item has unique/ specific content that some students were exposed to 
through instruction, whereas other students were not. The data generating model for 
the correct response to this item is then a mix of the two group's response functions: 

P(l\6) = p 1 P 1 (l\6) + p 2 P 2 (l\6) (7) 

where p\ and pi = 1 — Pi are the proportion of respondents in each group, and Pi and 
P 2 are short-hand for the response functions for Group 1 and 2, respectively. Suppose 
item parameters of k = —1.39 and co = .8 for both groups, but 5 = —4 for Group 1 
and <5 = 4 for Group 2. If we use mixing proportions of 70% and 30% for Groups 1 
and 2, then this results in response functions with non-standard bends (see Figure 2). 
Although this response function does not come from the LMPA model, the LMPA can 
still provide a good approximation to it. To illustrate. Figure 3 displays the same mixture 
IRF (black line) with the best-fitting 3PL line (blue line; left panel) and LMPA line (blue 
line; right panel) with only a 3rd-order polynomial (k = 1). Although the LMPA does 
not completely overlap with the mixture IRF, the two are almost indistinguishable. Thus, 
while some approaches do more closely represent the mixing of IRFs or their parameters 
to achieve flexible IRF shapes (e.g., Duncan & MacEachern, 2008, 2013; Miyazaki & 
Hoshino, 2009) or explicit identification of latent classes (e.g., Rost, 1990), the LMPA 
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approach may provide a reasonable alternative solution that can accommodate such 
shapes. Importantly, it may also serve as a leading indicator followed by more detailed 
and computationally-demanding analysis. 

2.3 Estimation and the use of stabilizing prior distributions 

As Falk and Cai (in press) did for other models including the monotonic polynomial, 
we used the Bock and Aitkin (1981) machinery of maximizing the marginal likelihood 
using the EM algorithm for use with the LMPA approach. Full first- and second-order 
derivatives of the LMPA model necessary for estimation are given in Appendix A. Since 
the LMPA model adds additional parameters to the 3PL, which is known to be difficult 
to estimate, prior distributions were placed on some model parameters to provide addi- 
tional stability. Technically, our approach is Bayesian. It constitutes finding the mode of 
the posterior distribution for model parameters, though we note this approach is com- 
mon in estimation of the 3PL (Bock, Gibbons, & Muraki, 1988; Cai, Yang, & Hansen, 2011; 
Mislevy, 1986). It may be best to understand the effects of these priors as stabilizing soft 
constraints on gradients and ridging /conditioning constants on the Hessian, without 
completely abandoning the operational efficiencies of maximum marginal likelihood. 

Specifically, we used diffuse normal priors on all a and r parameters as also done 
by Falk and Cai (in press), such as n(a) ~ A/"(0,50) and tc{ r) ~ N{— 1,50). To provide 
stability to the lower asymptote, we use a prior such as tc(k) ~ Af(— 1.39, .25), where 
c(— 1.39) .20, or the rate of guessing we might expect from a five option multiple 
choice question (see also Cai et al., 2011). The final prior used involves that analogous to 
placing a Beta prior on item uniqueness to prevent Heywood cases in item factor analysis 
(e.g., Bock et al., 1988; Mislevy, 1986). We show how to adapt this prior developed under 
multidimensional IRT to the current setting. 1 

Suppose we derived the LMPA item model using a probit rather than a logistic dis- 

1 A version of the LMPA model that omitted the Beta prior was presented at the 2013 International Meet- 
ing of the Psychometric Society. Omission of the Beta prior, a weaker prior on k, and little troubleshooting 
of starting values tended to yield mediocre performance of the LMPA in simulations. 
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tribution. For example, underlying the observed response, y/y, is a variable that is a 
function of a monotonic polynomial, 

2k+l 

y* = m(9, Ay) + £y = ^ ^jq^ + £ j (8) 

< 7=1 

We may further assume that ey ~ A/"(0, ipj), and that 9 ~ Af(0, it 2 ), usually with a 2 = 1. 
Dichotomous responses are produced from y* via: 


| 1, if V-, > r, 
\ 0, if y*j<rj 


(9) 


where ry is the threshold for item j. The probability of a correct response under this 
model may be written as: 


P*(1|0) 


c ( K i ) + 


1 ~ C_{Kj) 
xpjV2n 





m{9, Ay) 

<Pj 


2 


c(Ky) + (l-c(Ky))<f> 



m(0,Ay)\ 

"W / 



( 10 ) 

( 11 ) 


where <3> is the standard cumulative normal distribution function. This resembles stan- 
dardized version of the normal ogive model with guessing (e.g., Bock et al., 1988), but 
with the linear predictor replaced by a monotonic polynomial. The goal in remedying 
Heywood cases in this situation involves preventing the unique variance ipj from getting 
too small, or alternatively the steepness of the function implied by m{6, Ay) from getting 
too large. If we assume a standard normal distribution for y* and uncorrelated 9 and ey, 
it follows that 


ipj = 1 — var(m(0. Ay)) 


( 12 ) 
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Thus, ipj is not directly estimated, but is a function of m(6, A y). We may also approx- 
imate ipj as a function of the parameters implied by the LMPA: 


^ 1 + (1/D 2 )var(m(0,by)) 


(13) 


where D = 1.702 is the usual scaling constant. Placing a Beta prior on ipj, such as 
Tc(ipj) ~ B(p, q)- 1 (ipj)?- 1 ( 1 — i pjy- 1 , where B is the Beta function and with q = 1, 
effectively prevents ipj from being too small (e.g., Bock et al., 1988), and results in 
changes to the posterior mode for the item parameters in m(0, by) = m(6, coj, ay, r y). This 
strategy of developing weakly informative priors is not atypical in Bayesian inference, 
where one may choose to motivate the choice by imposing a prior on an alternative 
parameterization of the model (the item unique variance in this case), which then induces 
the desired prior on parameters of interest (the slopes). A typical choice for p is 1.5 (Cai 
et al., 2011), which we employ throughout this paper, and further details as to how the 
Beta prior changes the log-likelihood and derivatives are given in Appendix B. 

3 Example: Large-scale state math assessment 

To illustrate the LMPA item model, we utilized data from 10,000 Grade 4 students 
who provided responses to 44 dichotomously scored items on a state math assessment. 
The 3PL model and all LMPA models described in this section used priors as described 
in the previous section. As a preliminary check, use of a 3PL model for all items (AIC 
= 48,5581, BIC = 48,6533) indicated better fit than using a two-parameter logistic model 
for all items (AIC = 48,7290, BIC = 48,7924), and the mean pseudo-guessing parameter, 
c(k), for the 3PL model was approximately .17. We therefore concentrated on the 3PL as 
our baseline model. 

3.1 Screening of item fit 

The next step was to determine whether the LMPA model may be useful in improving 
model and/or item fit. For this task, we utilized S — X 2 using the version outlined by 
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Orlando and Thissen (2000). In brief, by conditioning on sum-score based groups, the 
test statistic detects whether observed counts of correct/ incorrect are congruent with the 
expected counts based on the model. Specifically for item j, 

, ^ (On - E jt ) 2 

S-X =r (14) 

where the sum-score groups range from 0, 1, . . . , v with v being the maximum observed 
sum-score, Nt is the number of respondents in sum-score group t, Op is the observed 
proportion of correct responses, and Ey f is the expected proportion of correct responses. 

in turn can be computed using the Lord-Wingersky algorithm (Lord & Wingersky, 
1984) as described by Orlando and Thissen (2000). The statistic is compared to a central 
chi-square distribution with degrees of freedom of df = t — 1 — Zj where Zy is the number 
of estimated parameters for item j. If adjacent sum score groups are collapsed due to 
low expected counts (usually less than 1), additional df adjustments are made. 

The procedure for screening items for misfit for the empirical example, and in simu- 
lations, was as follows, using the baseline (3PL) model as the starting model. 

1. Compute S — X 2 for all items in the current model. 

2. Flag all items with misfit below a threshold (e.g., p < .05). 

3. For all flagged items (if any), increase the order of the polynomial (e.g., if k = 0, 
change to k = 1; if k = 1 increase to k = 2). 

4. If any items were changed as a result of Step 3, re-fit the model. 

The Steps 1 through 4 may be repeated as many iterations as desired. For the current 
example and in simulations, we repeated this process only twice, meaning that at most 
the model was fit 3 times (3PL and two cycles of the above steps) and the maximum 
polynomial order for any fitted item was k = 2 (5th order). In addition, for Step 2, it 
is possible to employ either very liberal criteria in screening out items (e.g., no control 
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of Type I error or false discovery rate), or to employ a variety of corrective techniques. 
We experimented with using no correction (referred to hereafter as NC), and p-value 
adjustments using the Benjamini-Hochberg procedure (referred to hereafter as BH; Ben- 
jamini & Hochberg, 1995), which is sometimes advocated for use in IRT contexts as a 
high-power alternative to the Bonferroni method (Thissen, Steinberg, & Kuang, 2002). 
While we may expect the NC approach to lead to overfitting, it will have higher power 
and we later examine in simulations whether such overfitting has averse consequences. 
3.2 LMPA Results 

S — X 2 initially flagged 11 items as poorly fitting under NC and 7 items when using 
the BH correction. Thus, roughly 16% to 25% of items may have poor fit. Flagging and 
fitting items twice under NC resulted in 6 items modeled with k = 1 and 6 items with 
k = 2 using the LMPA model. Under BH, these numbers were predictably lower, with 
5 items as k = 1 and 2 items as k = 2. Both approaches led to fewer items having poor 
fit according to S — X 2 . For instance, the final NC model had 6 items with poor fit and 
the final BH model had only 2. As shown in Table 2, both final NC and BH models also 
improved AIC, with the NC model indicating slightly better fit. However, BIC actually 
preferred the 3PL model. 

We interpret these results as meaning that there is some promise for the LMPA in re- 
ducing the number of poorly fitting items according to S — X 2 . The information criterion 
results are mixed in terms of whether the overall final model is an improvement, with 
AIC suggesting an improvement, but a higher penalization for more model parameters 
under BIC suggesting overfitting. It is therefore difficult to guess whether the LMPA is 
substantially improving IRF recovery. We next turn to simulation results to further test 
the potential impact of the LMPA and our modeling procedure. 
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4 Simulation 
4.1 Method 

We performed a small simulation study to test the performance of the LMPA IRT 
model and the procedure we employed in using S — X 2 to flag candidate items. We used 
a 2 (% non-standard IRF: 20% vs. 40% of items) x 2 (N: 1,000 vs. 5,000) overall design. 
In all cases, we used 40 items, generated latent proficiencies, 6, from a standard normal 
distribution, and performed 100 replications per cell of the design. Studied conditions 
were thus chosen to partially overlap with conditions found with the empirical example. 

Most items were 3PL items (LMPA with k = 0) with item parameters (k, co, and 
S) randomly drawn in each replication from a multivariate normal distribution that 
matched the mean and covariance of the all 3PL model in the previous section. To 
avoid too many extreme items, parameter draws with any item greater than 1.65 SD 
away from its mean were discarded and a new set was drawn. The remaining items 
(8 or 16 items, depending on condition), were randomly constructed from a mixture 
of normal cumulative distribution functions (CDF) and lower asymptotes. Specifically, 

Pligl + (^-gl)^(0\Hl,Crl)) + p2(g2 + (l-g2)^(0\ll2^2)) + P3(g3 + (1 ~ g 3 )^(9\p 3 ^)) , 
with pi and p 2 ~ unif(.2, .4), p 3 = 1 — p\ — p 2 , pi ~ A r (— 1.5, A 2 ), p2 ~ A r (1.5, A 2 ), 
p 3 ~ A/"(0, A 2 ), all g parameters drawn from unif(.l, .3), and <Ji,(J 2 , and <73 indepen- 
dently drawn from a log-normal distribution, lnA/"(— 1.03, .22). Here we use <E> to denote 
the cumulative distribution function along 9 for a given mean and variance. 

To each generated dataset, we mimicked the steps taken in our empirical example. 
That is, we initially fit a 3PL model, followed by flagging items using NC or BH, and 
fitting flagged items with higher-order polynomials. We repeated re-fitting twice for 
both NC and BH, resulting in final NC and BH models. Thus, we are interested in 
comparing the 3PL to the final NC and BH models. Numerical integration necessary for 
estimation was done with 49 equally spaced quadrature points between -5 and 5 across 
6. The maximum number of M-step iterations was set at 50, and the maximum number 



March 31, 2015 


13 


of EM cycles was set at 500. Model estimation terminated if item parameters from one 
EM cycle to the next differed by le-4 or less. 

To further mimic a real data analysis situation, we never started estimation at the 
true model parameters, but rather at k = — 1, 8 = 0, co = 0, a. = 0, and r = —1 for all 
items. However, to automate troubleshooting of estimation problems, we employed the 
following ad-hoc procedure for all replications. Initial starting values for item param- 
eters were further tweaked by using the final estimates of a model run with very few 
maximum M-step iterations and EM cycles (2 and 5, respectively). We then commenced 
with estimation as outlined in the previous paragraph. If the maximum number of EM 
iterations was reached, or if ridging the Hessian ever failed to yield an invertible matrix, 
new starting values were attempted by normalizing sum scores across simulees for use 
in the complete-data likelihood for each item. The item parameter estimates based on 
this complete-data analysis were then used as starting values. 

4.2 Results 

Overall it was our hope and expectation that the procedure of using S — X 2 would 
tend to correctly flag non-standard IRFs (using either NC or BH), which could in turn 
be modelled by the LMPA with higher-order polynomials. This may result in better 
recovery of individual IRFs. 

4.2.1 Ability of S — X 2 to detect non-standard IRFs 

In general, power to detect non-standard IRFs was slightly less than anticipated. 
Table 3 displays power and false positive rates based on the final NC and BH models 
for all cells in the design. Power is the proportion of items with non-standard IRFs 
that were fitted with the LMPA model with k > 0, whereas the false positive rate is the 
proportion of items with a true 3PL shape fitted by the LMPA model with k > 0. Both 
were computed across all replications in each cell. Thus, power to detect non-standard 
IRFs only reaches a maximum of .36 or .33 for NC and .14 for BH (both when N = 5, 000). 
This amounts to correctly detecting approximately 3/8 items for NC and 1/8 items for 



March 31, 2015 


14 


BH when 20% of items have non-standard IRFs, or 5 to 6/16 items for NC and 2/16 
items for BH when 40% of items have non-standard IRFs. 

While we do not immediately have a full explanation, we note that (Orlando & 
Thissen, 2003) found the lowest power for S — X 2 for items that had a non-standard 
bend or a plateau in the middle of the IRF - visually similar to the items we generated 
- versus items that had non-monotonicity or an omitted asymptote. Since non-standard 
IRFs were generated via a random process, it is not necessarily the case that all items 
would in fact look very extreme. This low power rate may not necessarily be problem- 
atic, however, to the extent that both approaches may flag the worst-fitting items and 
flag non-standard items at a higher rate than 3PL items. 

4.2.2 Recovery of IRFs 

We assessed recovery of IRFs by using Root Integrated Mean Square Error (RIMSE; 
e.g., Liang, 2007; Ramsay, 1991), defined as the following for item j: 


RIMSE/ = 


\ e, q =i m) 


1/2 


x 100 


(15) 


where P(1|0) and P(l\9) are short-hand for the estimated and true response function for 
the correct response, (p is a standard normal density function, and the sum is across a 
series of Q equally spaced quadrature points along 6 (-5 to 5 in .1 increments). RIMSE 
thus represents the root of a weighted average squared discrepancy between the true 
and estimated response functions, with more weight given to discrepancies towards the 
middle of the 6 distribution. Lower values indicate better IRF recovery. This index 
was computed for all estimated items and was aggregated across replications and items 
where necessary. 

Since NC and BH approaches did not always suggest poorly fitting items for each 
replication, we separately compared each of NC and BH to the 3PL model, focusing only 
on those replications where the final BH and NC models differed from the 3PL. Overall, 
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the NC model slightly improved RIMSE versus the 3PL model across all conditions (see 
Table 4), with gains more prominent at the larger sample size (N = 5000), perhaps 
in part to higher power to flag non-standard IRFs. Examining items where the true 
underlying IRF was either a 3PL or a non-standard item revealed that all of these gains 
were because of improvement in recovering non-standard items. In fact, recovery of 3PL 
items was slightly worse for the NC model, though perhaps negligibly so. 

Overall, the BH model also slightly improved RIMSE versus the 3PL model across all 
studied conditions (see Table 5), in a similar pattern to NC. Gains in RIMSE were also 
due to better modeling of non-standard items. Albeit based on a much smaller number 
of replications, use of BH appeared to have no averse impact on the recovery of 3PL item 
IRFs, in slight contrast to using NC. 

5 Discussion 

We have presented a new IRT model, the logistic function of a monotonic polynomial 
with asymptote (LMPA), that can account for guessing as does the 3PL, but has a more 
flexible IRF shape. We proposed a possible data generating mechanism that may yield 
such non-standard IRFs - heterogeneous populations whose IRFs are mixed together in 
generating the item responses. Finally, we tested use of S — X 2 to flag candidate items 
for use with the LMPA model in an empirical example and through simulations. 

Our empirical example suggests that use of S — X 2 and LMPA can improve the num- 
ber of well-fitting items without it being necessary to employ a computationally inten- 
sive AIC or BIC step-wise approach. For instance, the number of poorly fitting items 
was reduced by approximately half or better, depending on whether NC or BH p-values 
under S — X 2 were examined. Such initially poor-fitting items may be those that show 
instructional sensitivity in cases where curriculum and instruction is not implemented 
in a fully standard way across the entire population. Our procedure thus serves the dual 
purpose of identifying potential items, and allowing retention of such items (which can 
be expensive to develop) in a large-scale assessment if practice dictates that poor fitting 
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items are flagged for possible deactivation. 

Our simulation results suggest that the procedure utilized on our empirical example 
is sound and can lead to better IRF recovery. This was especially the case with a higher 
sample size, and to a certain extent cases where more items had true non-standard IRFs. 
Interestingly, however, it is difficult to make a strong recommendation on whether no 
p-value correction for S — X 2 should be employed or whether to use the Benjamini- 
Hochberg (Benjamini & Hochberg, 1995) correction. Both yielded comparable gains in 
RIMSE. If parsimony is preferred, the BH approach would yield fewer items modelled 
using the LMPA approach. 
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Appendix A. Derivatives for the LMPA model. 

We used complete first- and second-order derivatives of the LMPA model in the 
M-step of the EM algorithm. We drop item and respondent subscripts and adopt the 
following short-hand notation for the response function in Equation 3: 


P 


/ \ 1 - c(k) 

CW + 1 + exp (-(<$ + m(0))) 


(16) 


where m{6) is short-hand for the monotonic polynomial in Equation 4. We also define 
Q = 1 — P as the probability of an incorrect response. The complete-data log-likelihood 
for a single item and response can then be written as: 


' = yiog(P) + (l — y)iog(i-P) 


(17) 


where y is the observed response. Differentiating (17) with respect to any item parame- 
ter, ?/, leads to: 


dl _ y — PdP 
dt] PQ dy 

First-order derivatives are obtained by substitution for 

^ = C ( K )(1 - c(k))( 1 - c(m(6 ))) 

= (1 - c{K))c{m{6)){l - c(m(6))) 
|^ = (1 - c (K))c(m(0))( 1 - 
^ = (1 - c(K))c(m(e))(l - 
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where 

dm(9) 

da 

da 

d co 

da 

doc s 

da 

dr s 


1 

q^ 

MH 

q^ 

to 

5» 3 ■ 

1 a 2fc+l 
2k+l v 

TfcTfc-i • • 

• T 2 T 1 exp(o;) 


dT s 

■ ■ T 2 T 1 exp(o; 

TfcTfc-l ’ ’ 

da s 

TfcTfc— i • • 

dT s _ 

dT s 

• • T 2 T 1 exp (a; 


and derivatives of the matrices T are as follows: 


0 0 0 

-2 0 0 

2 OLjs -2 0 

0T S _ 0 2 Pij s -2 

doij s : : : 

0 0 0 

0 0 0 

0 0 0 


0 0 0 

0 0 0 

0 0 0 

0 0 0 

2a. j s —2 0 

0 2a j s —2 

0 0 2a j s 


3Ts 

dr -j s 


0 0 0 ••• 0 0 0 

0 0 0 ••• 0 0 0 

exp (tj s ) 0 0 • • • 0 0 0 

0 exp (rjs) 0 • • • 0 0 0 

0 0 0 • • • exp(ry s ) 0 0 

0 0 0 • • • 0 exp(ry s ) 0 

0 0 0 • • • 0 0 exp(ry s ) 
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To obtain second-order derivatives, we add subscripts to t] and differentiate again, 

a 2 / a rapy-p' 

df]xdf]2 dill [df]2 PQ 

d 2 P \y-P 1 dP dP T_ 1 _(y-P)(Q-P)' 

a//ia//2 L PQ J + dy 2 diii [ PQ (PQ) 2 

Again, we substitute for each type of item parameter: 


a 2 p 

died*; 

a 2 p 

dxdS 

a 2 p 

dxdco 

a 2 p 

dKda s 

a 2 p 

dKdT s 

a 2 p 

a^w 

a 2 p 

dSdco 

a 2 p 

dSd(X s 

a 2 p 

dSdr s 

a 2 p 

devdev 

a 2 p 

dcodiXg 

a 2 p 

dcodTs 

a 2 p 

dtX s dtX s 

a 2 p 

doi s doit 

a 2 p 

dtXgd Tg 


c(k)(1 - c(jc))( 1 - 2c(k))(1 - c(m(B ))) 

— c(/c)(l — c(7c))c(m(0))(l — c(m(9))) 

-c(k)( 1 - c(jc))c(m(0))(l - c(m(0)))^^|^ 

-c(k)( 1 - c(jc))c(m(0))(l - c(m(0)))^^|^ 

-c(k)( 1 - c(jc))c(m(0))(l - c(m(0)))^^|| 

(1 - c(x))c(m(0))(l - c(m(0)))(l - 2(c(m(0)))) 

(1 - c( K )) C (m(0))(l - c(m(0)))(l - 2 (c(m(0))))^ 

(1 - c( K )) C (m(0))(l - c(m(0)))(l - 2(c(m(0))))^|^ 

(1 - c( K )) C (m(0))(l - c(m(0)))(l - 2(c(m(0))))^|| 

(1 - «,)W«W)(1 - «(«(»))) [d - 2 c(m(9 )))^> ^ + gj 

(1 - c(K))c(m(9))(l - c(«(»))) [d - 2c(*(9)))^^ + gj 

(1 - c( K ))c(m(9))(l - c(m(e))) (1 - 2 c(m( 9 )))^^b + gb 
d - c( K ))c(m(9))(l - c(m(e))) (1 - 2c(m(e)))?gg> + g£ 

d - cWW»(9))d - c(m(0))) [(1 - 2 c(*(»)))^^ + gj 
d - £(k)W*( 8))(1 - c(m(0))) (1 - 2t(«(8))) yy + g^ 
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d 2 P 

dtx s dz t 

d 2 P 

dz s dz s 

d 2 P 

dz s dz t 


(1 — c(jc))c(m(0))( 1 — c(m(0))) 
(1 — c(jc))c(m(0))(l — c(m(Q))) 
(1 — c ( k )) c ( ot ( 0))(1 — c(m(0))) 


(1-2 c(m(0))) I 


Oa, Or# 


doc s dz t 


(1 — 2c(m(0))) - 1 - 


Ore Ot<; 


dz s dz s 


(1 — 2c(m(0))) ^ I 


dz t Otc 


dz s dz t 


Where second-order derivatives for m(0) are as follows: 


d 2 m{0 ) _ dm{0) 


dcodco 

d 2 m(0) 

dcoda s 

d 2 m(0) 

dcodzs 

d 2 m(0) 

dot s doL s 

d 2 m{0) 

doc s d(Xt 

d 2 m(0) 

dlXgdZg 

d 2 m{0) 

doc s dzt 

d 2 m(0) 

dz s dz s 

d 2 m{0) 

dzgdzt 


da 

dm(0) 

da. 

dm(0) 

da 

dm(0) 

da 

dm(0) 

da 

dm(0) 

da 

dm(0) 

da 

dm(0) 

da 

dm(0) 

da 


Tyifc-i 

Tyi\_i 

TyT/t-i 

Tyi\_i 

Tyi\_i 

Tyi\_i 

Tyi\_i 

T/cT/t-i 

T/cT/t-i 


T2T1 exp (a; 
dT s 


dOL s 
dTg 

dz s 
d 2 T s 

doigdoig 

9 Ts dTf 

docg doct 

3 2 T s 
dcigdzg 
9 Ts OTf 

0a s 0r f 

3 2 T s 


T2T1 exp(a;) 

T2T1 exp(a;) 

T 2 Ti exp(a;) 

T2T1 exp(a?) 


T2T1 exp(a;) 

T2T1 exp(w) 


dzgdzg 

dT s dT 


■ ■ ■ T2T1 exp(cx) 


dzg dzt 


f •••T 2 T 1 exp(cx) 
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Finally, second-order derivatives of the matrices T are the following: 


a 2 T s 

dtx s da. s 


0 0 0 
0 0 0 
2 0 0 
0 2 0 

0 0 0 
0 0 0 
0 0 0 


0 0 0 
0 0 0 
0 0 0 
0 0 0 

2 0 0 
0 2 0 
0 0 2 


d 2 T s 

dttgdTs 


= 0 


c) 2 T s 

dr s dr s 


0 0 0 

0 0 0 

exp(r s ) 0 0 

0 exp(r s ) 0 


0 0 0 
0 0 0 
0 0 0 


0 0 0 

0 0 0 

0 0 0 

0 0 0 

exp(r s ) 0 0 

0 exp(T s ) 0 

0 0 exp(r s ) 


A Appendix B. Beta prior with the LMPA model. 

As in the previous Appendix, we adopt the short-hand notation of m(6) for the mono- 
tonic polynomial. According to Equation 13, the unique item variance is approximately 
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the following under the LMPA model: 




1 

1 + (1/D 2 )var(m(0)) 


(19) 


Since we are considering a single latent trait, the variance of the monotonic polyno- 
mial is the following: 


var(m(0)) = b'Tb 


( 20 ) 


where b are the coefficients for the polynomial, and T is a symmetric matrix: 

var(0) 

co v(0, 0 2 ) var (0 2 ) 


( 21 ) 


^ cov(0, 9 2k+1 ) cov(0 2 ,0 2k+1 ) ... var(0 2fc+1 ) J 

The elements of T can be obtained simply by computing the moments of the distri- 
bution for 9, which in our application is a standard normal distribution. Addition of a 
Beta prior ridges the complete-data log-likelihood by the following: 


(-1) log (B(p,q)) + (p- 1) log(i/> 2 ) (22) 


Differentiating this quantity with respect to a typical parameter, rj, we have: 


a(p- i)!og(^ 2 ) = i dtp 2 

dt] ^ ip 2 dt] 

= — (2/D 2 )(p — 1) da db 
1 + (l/D 2 )bTb dt] 3a 


(23) 

(24) 


where ^ for each type of model parameter are given in Appendix A, and is a diagonal 


matrix: 
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1 0 ... 0 

3b _ 0 1/2 ... 0 

da : : 


0 0 ... 1 /( 2 * + 1 ) 


Thus, when a Beta prior is present, complete-data first-order derivatives are adjusted 
by the quantity in (23). Second-order derivatives are adjusted by the quantity on the 
following page, obtained by differentiating (23) a second time. 



d 2 (P ~ 1) l°g( t / ;2 ) 

dr]idf] 2 


( 25 ) 


(2 m{p - 1 )f -{W A )t 5 -^-rb — -rb 


3a 3b, 


3//i 3a 


3a 3b, 


dr /2 3a 


3 2 a 3b rb 

3//i 3//2 3a 


N> 
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Table 1: Item Parameters from Example IRFs 


Parameter 

Item 2 (Jfc = 0 / 3PL) 

Item 24 (Jfc = 0 / 3PL) 

Item 2 (Jfc = 2) 

Item 24 (Jfc = 1) 

K 

-1.91 

-2.14 

-1.81 

-1.59 

s 

-2.19 

.77 

-2.25 

.80 

CO 

.70 

-.54 

1.14 

-.84 

0C\ 



.59 

.40 

&2 



.35 


Tl 



.25 

-1.20 

?2 



-6.47 



Note. IRF = item response function; 3PL = three parameter logistic. 


Table 2: Model overview of empirical example 



3PL 

NC Final Model 

BH Final Model 

# Parameters 

132 

168 

150 

-21ogL 

485317 

485111 

485199 

AIC 

485581 

485447 

485499 

BIC 

486533 

486659 

486580 


Note. 3PL = three parameter logistic; NC = no correction to S — X 2 p- values; BH = 
Benjamini-Hochberg correction to S — X 2 p-values. 
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Table 3: S — X 2 detection rates for the simulation study 


20% non-standard IRFs 40% non-standard IRFs 
N = 1,000 N = 5,000 N = 1,000 N = 5,000 

Power 


NC 

.101 

.361 

.111 

.336 

BH 

.011 

.144 

.009 

.144 

False positive 
NC 

.052 

.055 

.059 

.082 

BH 

.003 

.004 

.002 

.011 


Note. IRF = item response function; NC = no correction to S — X 2 p- values; BH = 
Benjamini-Hochberg correction to S — X 2 p-values. 



Table 4: IRF recovery (RIMSE) for NC from the simulation study 



# rep. 

Overall RIMSE 

3PL items 

Non-standard items 

N = 1,000 

20% non-standard IRFs 
3PL Model 

89 

2.67 

2.17 

4.64 

NC Model 

89 

2.66 

2.24 

4.48 

40% non-standard IRFs 
3PL Model 

95 

3.16 

2.14 

4.67 

NC Model 

95 

3.12 

2.21 

4.48 

N = 5, 000 

20% non-standard IRFs 
3PL Model 

97 

1.62 

1.03 

4.01 

NC Model 

97 

1.50 

1.06 

3.23 

40% non-standard IRFs 
3PL Model 

100 

2.22 

1.04 

3.99 

NC Model 

100 

1.96 

1.09 

3.26 


Note. IRF = item response function; RIMSE = Root integrated mean square error; 3PL = three parameter logistic; NC = no 
correction to S — X 2 p-values; BH = Benjamini-Hochberg correction to S — X 2 p-values. 


N> 
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Table 5: IRF recovery (RIMSE) for BH from the simulation study 



# rep. 

Overall RIMSE 

3PL items 

Non-standard items 

N = 1,000 

20% non-standard IRFs 
3PL Model 

15 

2.71 

2.18 

4.84 

BH Model 

15 

2.66 

2.19 

4.54 

40% non-standard IRFs 
3PL Model 

16 

3.18 

2.16 

4.71 

BH Model 

16 

3.13 

2.18 

4.56 

N = 5, 000 

20% non-standard IRFs 
3PL Model 

67 

1.65 

1.02 

4.14 

BH Model 

67 

1.53 

1.02 

3.58 

40% non-standard IRFs 
3PL Model 

89 

2.23 

1.04 

4.01 

BH Model 

89 

2.04 

1.03 

3.55 


Note. IRF = item response function; RIMSE = Root integrated mean square error; 3PL = three parameter logistic; NC = no 
correction to S — X 2 p-values; BH = Benjamini-Hochberg correction to S — X 2 p-values. 


U> 

o 


March 31, 2015 




fit to math assessment data using 
f the same items fit with the LMPA 
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Figure 2: Mixture of IRFs 
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P(1I0) 
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Figure 3: Best fitting 3PL and LMPA functions to mixture IRF 




9 9 


Note. Black line is mixture IRF. Left panel includes best fitting 3PL (blue), and right panel 
includes best fitting LMPA line (blue) using k = 1. 


